NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

FP-SMR: A Fully Digital Floating-Point Processing-in-SAS-MRAM for Session-based Recommender System

https://doi.org/10.1145/3716368.3735206

Ali, Asmer Hamid; Sridharan, Amitesh; Guo, Cheng; Hwang, William; Tsai, Wilman; Zhang, Jeff; Chen, Yiran; X_Wang, Shan; Fan, Deliang (June 2025, ACM)

With the rapid advancement of DNNs, numerous Process-in-Memory (PIM) architectures based on various memory technologies (Non-Volatile (NVM)/Volatile Memory) have been developed to accelerate AI workloads. Magnetic Random Access Memory (MRAM) is highly promising among NVMs due to its zero standby leakage, fast write/read speeds, CMOS compatibility, and high memory density. However, existing MRAM technologies such as spin-transfer torque MRAM (STT-MRAM) and spin-orbit torque MRAM (SOT-MRAM), have inherent limitations. STT-MRAM faces high write current requirements, while SOT-MRAM introduces significant area overhead due to additional access transistors. The new STT-assisted-SOT (SAS) MRAM provides an area-efficient alternative by sharing one write access transistor for multiple magnetic tunnel junctions (MTJs). This work presents the first fully digital processing-in-SAS-MRAM system to enable 8-bit floating-point (FP8) neural network inference with an application in on-device session-based recommender system. A SAS-MRAM device prototype is fabricated with 4 MTJs sharing the same SOT metal line. The proposed SAS-MRAM-based PIM macro is designed in TSMC 28nm technology. It achieves 15.31 TOPS/W energy efficiency and 269 GOPS performance for FP8 operations at 700 MHz. Compared to state-of-the-art recommender systems for the same popular YooChoose dataset, it demonstrates a 86 ×, 1.8 ×, and 1.12 × higher energy efficiency than that of GPU, SRAM-PIM, and ReRAM-PIM, respectively.
more » « less
Free, publicly-accessible full text available June 29, 2026
Learning to Prune and Low-Rank Adaptation for Compact Language Model Deployment

https://doi.org/10.1145/3658617.3697648

Ali, Asmer Hamid; Zhang, Fan; Yang, Li; Fan, Deliang (January 2025, ACM)

Nowadays, parameter-efficient fine-tuning (PEFT) large pre-trained models (LPMs) for downstream task have gained significant popularity, since it could significantly minimize the training computational overhead. The representative work, LoRA [1], learns a low-rank adaptor for a new downstream task, rather than fine-tuning the whole backbone model. However, for inference, the large size of the learned model remains unchanged, leading to in-efficient inference computation. To mitigate this, in this work, we are the first to propose a learning-to-prune methodology specially designed for fine-tuning downstream tasks based on LPMs with low-rank adaptation. Unlike prior low-rank adaptation approaches that only learn the low-rank adaptors for downstream tasks, our method further leverages the Gumbel-Sigmoid tricks to learn a set of trainable binary channel-wise masks that automatically prune the backbone LPMs. Therefore, our method could leverage the benefits of low-rank adaptation to reduce the training parameters size and smaller pruned backbone LPM size for efficient inference computation. Extensive experiments show that the Pruned-RoBbase model with our method achieves an average channel-wise structured pruning ratio of 24.5% across the popular GLUE Benchmark, coupled with an average of 18% inference time speed-up in real NVIDIA A5000 GPU. The Pruned-DistilBERT shows an average of 13% inference time improvement with 17% sparsity. The Pruned-LLaMA-7B model achieves up to 18.2% inference time improvement with 24.5% sparsity, demonstrating the effectiveness of our learnable pruning approach across different models and tasks.
more » « less
Free, publicly-accessible full text available January 20, 2026
SAFER: Sparsity Integrated Compute-in-Memory AI Accelerator with a Fused Dot-Product Engine and a RISC-V CPU

https://doi.org/10.1109/ESSERC66193.2025.11214049

Sridharan, Amitesh; Ali, Asmer Hamid; Lee, Yongjae; Anupreetham, Anupreetham; Liu, Yaotian; Zhang, Jeff; Seo, Jae-sun; Fan, Deliang (September 2025, IEEE)

Free, publicly-accessible full text available September 8, 2026

Search for: All records